Collect node thread pool usage for shard balancing #131480

DiannaHohensee · 2025-07-17T21:11:46Z

Adds a new transport action to collect usage stats from the
data nodes. ClusterInfoService uses the action to pull thread
pool usage information from the data nodes to the master node
periodically.

Also removes NodeUsageStatsForThreadPoolsCollector as
a plugin interface and replaces it with a single class
implementation. We no longer need a serverless collector
implementation, it can all be done in stateful.

Closes ES-12316

Adds a new transport action to collect usage stats from the data nodes. ClusterInfoService uses the action to pull thread pool usage information from the data nodes to the master node periodically. Also removes NodeUsageStatsForThreadPoolsCollector as an interface/plugin and replaces it with a single class implementation. Closes ES-12316

elasticsearchmachine · 2025-07-17T21:14:22Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

mhl-b · 2025-07-18T20:26:35Z

...va/org/elasticsearch/action/admin/cluster/node/usage/NodeUsageStatsForThreadPoolsAction.java

+/**
+ * Defines the request/response types for {@link TransportNodeUsageStatsForThreadPoolsAction}.
+ */
+public class NodeUsageStatsForThreadPoolsAction {


There is a TransportNodesStatsAction which can produce thread-pool usage. Do we need a separate action for this?

The calls to collect the thread pool stats are destructive. For example, collecting the max queue latency seen since the last call and then resetting max seen to zero. Pool utilization is also destructive, resetting an execution time tracker after collection. So we can't hook the new stats up to the TransportNodesStatsAction API and have random callers clearing the state we'll need for allocation.

If we're building a whole separate set of polling messages just because of the destructive-ness of the utilisation read, can we instead just limit the re-calculation to no more often than 30s or something (returning the most recent calculated value until that interval has passed) and have all readers poll from a single monitor? Or instead just switch to having a separate dedicated utilisation polling task on the data node, and make the reads non-destructive.

That sounds like a tidier alternative to me. We can discuss tomorrow at the catch up, but it seems to me like having separate monitoring state is now cascading into more work/maintenance.

can we instead just limit the re-calculation to no more often than 30s or something (returning the most recent calculated value until that interval has passed) and have all readers poll from a single monitor?

If we make it time based, recalculated on the data node every 30 seconds, and then the master node polls every 30 seconds, operations can race such that the master node sees the same value twice and misses a value that is just about to be calculated.

Or instead just switch to having a separate dedicated utilisation polling task on the data node, and make the reads non-destructive.

For this we'd have to build a component on the data node to asynchronously run a thread every 30 seconds to calculate a new value. Not obviously easier than building a TransportAction (which is also already implemented). Transport actions are apparently lightweight and there isn't concern about adding more, FWIW.

There was a discussion on this subject back in the 07-01 meeting (David, Pooya and I) (minutes 10-32). We were thinking EWMA at the time. We did like the idea of using the node stats api. The concern became public documentation of the new values because we were uncertain whether we'd want to change what metrics were sent in future. We can't change publicly visible stats. Arguably the stats we have now are easier to explain, though I'm not sure they're final yet.

I think the new action is more about having a separate action, request/response that we can play around with and include things into that may never go to node-stats. There is also not too much overhead to a new action so I prefer to have it as such. One can argue that node-stats should maybe not be used from cluster-info - rather we should have a dedicated action - but let us not go there now.

henningandersen

LGTM.

Left a few comments, but no need for another round.

henningandersen · 2025-07-30T18:32:45Z

...va/org/elasticsearch/action/admin/cluster/node/usage/NodeUsageStatsForThreadPoolsAction.java

+    public static class Request extends BaseNodesRequest {
+        public Request() {
+            // Send all nodes a request by specifying null.
+            super((String[]) null);


I think we want to send this to indexing nodes only, so might be worth adding the option to specify the nodes here?

I've changed this to take a list of nodes, and limited the given nodes to data nodes, since this is stateful and doesn't know about index nodes.

henningandersen · 2025-07-30T18:33:07Z

...va/org/elasticsearch/action/admin/cluster/node/usage/NodeUsageStatsForThreadPoolsAction.java

+        public Map<String, NodeUsageStatsForThreadPools> getAllNodeUsageStatsForThreadPools() {
+            Map<String, NodeUsageStatsForThreadPools> allNodeUsageStatsForThreadPools = new HashMap<>();
+            for (NodeUsageStatsForThreadPoolsAction.NodeResponse nodeResponse : getNodes()) {
+                // NOMERGE: Is the nodeID in NodeUsageStatsForThreadPools redundant? What is it useful for? If not, remove?


Seems like you want to fix this before merge? I am OK to keep it in the map and in the object, that is quite common.

Yep, I'll remove it. Let's see how this plays out, thanks for giving it a think.

henningandersen · 2025-07-30T18:48:47Z

...va/org/elasticsearch/action/admin/cluster/node/usage/NodeUsageStatsForThreadPoolsAction.java

+/**
+ * Defines the request/response types for {@link TransportNodeUsageStatsForThreadPoolsAction}.
+ */
+public class NodeUsageStatsForThreadPoolsAction {


I think the new action is more about having a separate action, request/response that we can play around with and include things into that may never go to node-stats. There is also not too much overhead to a new action so I prefer to have it as such. One can argue that node-stats should maybe not be used from cluster-info - rather we should have a dedicated action - but let us not go there now.

henningandersen · 2025-07-30T18:51:51Z

server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollector.java

+        if (clusterState.getMinTransportVersion().onOrAfter(TransportVersions.TRANSPORT_NODE_USAGE_STATS_FOR_THREAD_POOLS_ACTION)) {
+            client.execute(
+                TransportNodeUsageStatsForThreadPoolsAction.TYPE,
+                new NodeUsageStatsForThreadPoolsAction.Request(),


I think we only want to direct this to data nodes or even indexing nodes? No need to contact master or coordination nodes nor search only nodes.

Yes, I wandered down this rabbit hole initially, wondering how to collect a list of only index nodes. The problem is that this code is not in serverless, this is in the main repo: we don't have search vs index nodes :) I was thinking in stateful we would want to contact all the nodes. Though now that I think further, I should be able to filter for data role nodes -- at least not send requests to master role only nodes and such. I'll look into that now.

Down the line, we should have thread pool stats from search nodes for some kind of SearchLoadDecider. In the meantime, all I can think to achieve index-only is to follow up with some kind of override plug for serverless to specially filter the nodes sent here. Or else make all the write load decider logic serverless only, which seems like a shame to do based only on this detail.

henningandersen · 2025-07-30T18:58:04Z

server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java

+        assertNotNull("Expected to find stats for the WRITE thread pool", writeThreadPoolStats);
+        assertThat(writeThreadPoolStats.totalThreadPoolThreads(), greaterThan(0));
+        assertThat(writeThreadPoolStats.averageThreadPoolUtilization(), greaterThan(0f));
+        assertThat(writeThreadPoolStats.maxThreadPoolQueueLatencyMillis(), greaterThanOrEqualTo(0L));


It could be good to add a test where we block write-threads to ensure we have a queue latency here. We can do that in a follow-up, this is not essential before merge.

Sure, I flagged it on my JIRA ticket so I don't forget 👍

…ange

DiannaHohensee

Thanks for the review. I've reduced the nodes sent requests from all to only data nodes.

DiannaHohensee · 2025-08-01T17:58:24Z

...va/org/elasticsearch/action/admin/cluster/node/usage/NodeUsageStatsForThreadPoolsAction.java

+    public static class Request extends BaseNodesRequest {
+        public Request() {
+            // Send all nodes a request by specifying null.
+            super((String[]) null);


I've changed this to take a list of nodes, and limited the given nodes to data nodes, since this is stateful and doesn't know about index nodes.

…cking * upstream/main: (166 commits) Reduce inactive sink interval in VectorSimilarityFunctionsIT (elastic#132288) ESQL: Allow agg tests to process many columns (elastic#132358) Update analysis-lowercase-tokenfilter.md (elastic#132359) Add Sparse Vector Index Options Settings to Semantic Text Field (elastic#131058) Collect node thread pool usage for shard balancing (elastic#131480) Add tasks to validate new style transport versions (elastic#131782) Mute org.elasticsearch.search.routing.SearchReplicaSelectionIT testNodeSelection elastic#132354 Mute org.elasticsearch.xpack.esql.action.CrossClusterAsyncQueryIT testBadAsyncId elastic#132353 Fixes DenseVectorFieldIndexTypeUpdateIT release tests (elastic#132346) Fix testCloseOrReallocateDuringPartialSnapshot (elastic#132049) (Doc) ILM Force Merge not on HDD and happens on hosting node not current phase tier (elastic#130280) Run GeoIp YAML tests in multi-project cluster and fix bug discovered by tests (elastic#131521) Unmutes elastic#132111, seems a transient, non reproducible issue (elastic#132253) Mute org.elasticsearch.search.suggest.phrase.PhraseSuggesterIT testPhraseSuggestionWithNgramOnlyAnalyzerThrowsException elastic#132347 Add AI21 support to Inference Plugin (elastic#131238) OpenJDK EA builds should use https instead of http (elastic#132297) ESQL: Normalize timeseries aggs slightly (elastic#132284) Avoid internal server error on suggester ngram bad request (elastic#132321) [ES|QL] Rerank operator improvements (elastic#132318) Mute org.elasticsearch.xpack.logsdb.qa.LogsDbVersusReindexedLogsDbChallengeRestIT testTermsQuery elastic#132337 ...

DiannaHohensee self-assigned this Jul 17, 2025

DiannaHohensee added the >non-issue label Jul 17, 2025

DiannaHohensee requested a review from a team as a code owner July 17, 2025 21:11

DiannaHohensee added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0 labels Jul 17, 2025

elasticsearchmachine and others added 2 commits July 17, 2025 21:20

[CI] Auto commit changes from spotless

571ec0a

fix test and remove debug logging

3c6e8fd

DiannaHohensee requested review from nicktindall and mhl-b July 18, 2025 19:18

mhl-b reviewed Jul 18, 2025

View reviewed changes

DiannaHohensee mentioned this pull request Jul 21, 2025

(WIP) Collect node thread pool usage for shard balancing #131249

Closed

DiannaHohensee requested a review from henningandersen July 29, 2025 13:20

henningandersen approved these changes Jul 30, 2025

View reviewed changes

DaveCTurner mentioned this pull request Jul 31, 2025

Add utilization to the THREAD_POOL node stats #132156

Closed

DiannaHohensee added 6 commits July 31, 2025 14:46

Merge branch 'main' into 2025/07/17/new-allocation-transport-action

c28d091

remove NOMERGE

9290acf

Merge branch 'main' into 2025/07/17/new-allocation-transport-action

5f3d76e

send requests only to data nodes; and fix merge conflict with name ch…

5505fab

…ange

undo misc. change that slipped in

f378e34

Merge branch 'main' into 2025/07/17/new-allocation-transport-action

00b9e55

DiannaHohensee commented Aug 1, 2025

View reviewed changes

DiannaHohensee added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 1, 2025

elasticsearchmachine merged commit afc2051 into elastic:main Aug 1, 2025
33 checks passed

DiannaHohensee deleted the 2025/07/17/new-allocation-transport-action branch August 1, 2025 19:24

Collect node thread pool usage for shard balancing #131480

Collect node thread pool usage for shard balancing #131480

Uh oh!

Conversation

DiannaHohensee commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jul 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DiannaHohensee commented Jul 17, 2025 •

edited

Loading

nicktindall Jul 23, 2025 •

edited

Loading